Project 1| Unsupervised Learning

The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

In [1]:
%matplotlib inline
## Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_color_codes()
In [2]:
# load the dataset

sh_df = pd.read_csv('vehicle.csv')
sh_df.head()
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [3]:
# Shape of the dataset
sh_df.shape
Out[3]:
(846, 19)

1. Data pre-processing - Understand the data and treat missing values (Use box plot), outliers (5 points)

1a - Understand the data

1b - Find missing values

1c - Treat missing values

1d - Find outliers

1e - Treat outliers

1a - Understand the data

In [4]:
sh_df.info()
# Distribution of data types -
# dtypes: float64(14), int64(4), object(1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [5]:
sh_df.describe(include='all').transpose()
Out[5]:
count unique top freq mean std min 25% 50% 75% max
compactness 846 NaN NaN NaN 93.6785 8.23447 73 87 93 100 119
circularity 841 NaN NaN NaN 44.8288 6.15217 33 40 44 49 59
distance_circularity 842 NaN NaN NaN 82.1105 15.7783 40 70 80 98 112
radius_ratio 840 NaN NaN NaN 168.888 33.5202 104 141 167 195 333
pr.axis_aspect_ratio 844 NaN NaN NaN 61.6789 7.89146 47 57 61 65 138
max.length_aspect_ratio 846 NaN NaN NaN 8.56738 4.60122 2 7 8 10 55
scatter_ratio 845 NaN NaN NaN 168.902 33.2148 112 147 157 198 265
elongatedness 845 NaN NaN NaN 40.9337 7.81619 26 33 43 46 61
pr.axis_rectangularity 843 NaN NaN NaN 20.5824 2.59293 17 19 20 23 29
max.length_rectangularity 846 NaN NaN NaN 147.999 14.5157 118 137 146 159 188
scaled_variance 843 NaN NaN NaN 188.631 31.411 130 167 179 217 320
scaled_variance.1 844 NaN NaN NaN 439.494 176.667 184 318 363.5 587 1018
scaled_radius_of_gyration 844 NaN NaN NaN 174.71 32.5848 109 149 173.5 198 268
scaled_radius_of_gyration.1 842 NaN NaN NaN 72.4477 7.48619 59 67 71.5 75 135
skewness_about 840 NaN NaN NaN 6.36429 4.92065 0 2 6 9 22
skewness_about.1 845 NaN NaN NaN 12.6024 8.93608 0 5 11 19 41
skewness_about.2 845 NaN NaN NaN 188.92 6.15581 176 184 188 193 206
hollows_ratio 846 NaN NaN NaN 195.632 7.4388 181 190.25 197 201 211
class 846 3 car 429 NaN NaN NaN NaN NaN NaN NaN
In [7]:
# Finding Unique values
for i in sh_df.columns:
    print(i ,':', sh_df[i].unique() )
compactness : [ 95  91 104  93  85 107  97  90  86  88  89  94  96  99 101  84  87  83
 102  80 100  82 106  81 119  78  92  98 103  77  73  79 110 108 109 111
 105 112 116 113 117 115  76 114]
circularity : [48. 41. 50. 44. nan 43. 34. 36. 46. 42. 49. 55. 54. 56. 47. 37. 39. 53.
 45. 38. 35. 40. 59. 52. 51. 58. 57. 33.]
distance_circularity : [ 83.  84. 106.  82.  70.  73.  66.  62.  98.  74.  85.  79. 103.  51.
  77. 100.  75.  53.  64. 105.  80.  54.  63. 107.  nan  72.  86.  68.
 104.  87.  76.  81.  71. 101.  96.  78. 108.  91.  89.  94.  92.  60.
  57.  65.  50.  88. 109.  95.  90.  58.  69.  47.  40.  59. 110.  93.
 102. 112.  61.  42.  49.  44.  52.  55.]
radius_ratio : [178. 141. 209. 159. 205. 172. 173. 157. 140.  nan 143. 136. 171. 144.
 203. 201. 109. 197. 186. 215. 153. 121. 148. 219. 154. 119. 193. 129.
 160. 151. 222. 177. 118. 306. 176. 169. 214. 105. 137. 183. 220. 145.
 133. 122. 147. 115. 174. 228. 175. 185. 195. 221. 212. 135. 120. 156.
 125. 164. 161. 227. 191. 111. 170. 113. 127. 188. 180. 116. 158. 162.
 211. 152. 124. 252. 150. 130. 198. 202. 199. 128. 142. 163. 155. 184.
 165. 322. 194. 218. 216. 223. 149. 131. 139. 179. 166. 187. 167. 231.
 168. 126. 206. 210. 110. 189. 134. 132. 230. 196. 208. 138. 200. 225.
 246. 207. 192. 117. 123. 146. 190. 182. 204. 224. 333. 213. 226. 238.
 181. 114. 104. 112. 234. 235. 250. 232. 217.]
pr.axis_aspect_ratio : [ 72.  57.  66.  63. 103.  50.  65.  61.  62.  55.  68.  58.  71.  52.
  69.  nan  64.  59.  67. 126.  54.  73.  51.  70.  53.  56.  74.  60.
  76.  75.  49. 133.  47. 102. 138.  48.  97. 105.]
max.length_aspect_ratio : [10  9 52  6  7 11  5  8 49 12 22 48  4 13  3 43 25 46 19  2 55]
scatter_ratio : [162. 149. 207. 144. 255. 153. 137. 122. 183. 133. 123. 152. 174. 204.
 118. 177. 216. 208. 154. 150. 143. 147. 128. 218. 192. 146. 155. 140.
 142. 164. 157. 151. 205. 119. 158. 213. 159. 130. 148. 156. 163. 210.
 257. 185. 209. 193. 184. 225. 190. 215. 224. 176. 126. 195. 172. 127.
 261. 171. 125. 169. 197. 145. 214. 201. 114. 194. 220. 221. 173. 134.
 160. 121. 199. 186. 161. 222. 136. 187. 250. 202. 166. 132. 211. 247.
 116. 203. 181. 240. 219. 212. 138. 131. 117. 112. 165.  nan 226. 129.
 175. 135. 217. 200. 241. 198. 120. 260. 188. 170. 234. 236. 168. 167.
 227. 139. 256. 206. 141. 238. 178. 223. 252. 239. 115. 191. 179. 251.
 262. 196. 180. 189. 237. 265.]
elongatedness : [42. 45. 32. 46. 26. 48. 54. 36. 50. 43. 44. 37. 57. 31. 55. 53. 33. 47.
 40. 51. 34. 52. 35. 30. 38. 56. 39. 58. 59. 49. 27. 41. nan 61. 28. 29.]
pr.axis_rectangularity : [20. 19. 23. 28. 18. 17. 22. 21. 24. 25. nan 27. 26. 29.]
max.length_rectangularity : [159 143 158 144 169 146 127 130 118 148 154 166 129 139 173 145 125 142
 136 165 167 151 128 150 147 156 171 162 134 160 141 163 133 168 135 161
 137 178 175 186 132 138 152 153 122 157 140 174 124 164 172 170 131 123
 149 126 155 176 180 119 177 121 182 179 188 120]
scaled_variance : [176. 170. 223. 160. 241. 280. 162. 141. 202. 153. 148. 180. 173. 196.
 227. 137. 225. 175. 169. 164. 221. 143. 229. 217. 168. 165. 232. 186.
 174. 272. 235. 135. 178. 191. 159. 172. 184. 181. 236. 275. 154. 222.
 214. 145. 203. 231. 208. 226. 210. 197. 171. 155. 278. 189. 142. 218.
 179. 166. 237.  nan 212. 177. 167. 132. 151. 216. 219. 157. 224. 188.
 161. 207. 156. 152. 220. 266. 228. 185. 209. 204. 182. 200. 258. 146.
 183. 163. 238. 194. 134. 206. 136. 130. 190. 158. 147. 140. 265. 211.
 138. 247. 288. 234. 243. 256. 195. 213. 187. 205. 262. 320. 285. 215.
 150. 139. 267. 149. 193. 230. 254. 269. 264. 199. 192. 144. 131. 246.
 287. 240. 263.]
scaled_variance.1 : [ 379.  330.  635.  309.  325.  957.  361.  281.  223.  505.  266.  224.
  349.  345.  465.  624.  206.  485.  686.  651.  354.  221.  344.  307.
  623.  324.  238.  696.  570.  314.  356.  293.  304.  641.  402.  363.
  340.  346.  691.  336.  628.  207.  366.  405.  675.  371.  253.  317.
  352.  404.  299.  355.  661.  341.  956.  265.  512.  653.  241.  567.
  247.  269.  333.  523.  323.  748.  305.  558.  683.  732.  466.  227.
  338.  571.  445.  666.  328.  343.  671.  242.  311.  342.  998.  209.
  446.  229.  703.  430.  583.  312.  308.  337.  602.  321.  326.  347.
  246.  194.  576.  711.  575.  331.  329.  524.  357.  315.  192.  351.
  611.  712.  463.  370.  319.  216.  365.  605.  578.  511.  261.  669.
  364.  264.  230.  373.  320.  670.  406.  728.  387.  332.  360.  279.
  527.  525.  334.  645.  928.  240.  335.  259.  610.  415.  260.  665.
  707.  674.  243.  892.  313.  680.  469.  273.  638.  196.  612.  479.
  434.  367.  494.  866.  705.  350.  727.  225.  396.  681.  444.  284.
  682.  286.  388.  708.  195.  722.  531.  274.  704.  258.  640.  203.
  184.  642.  268.  419.  731.  684.  358.  374.  426.  673.  197.  716.
  310.  676.  604.  245.  457.  237.  518.  212.  218.  322.  452.  316.
  954.  327.  362.  521.  692.  427.  598.  251.  870.  252.  385.   nan
  492.  589.  249.  520.  213.  718.  982.  533.  435.  389.  300.  530.
  725.  822.  720.  625.  290.  271.  737.  833.  668.  546.  472.  504.
  519.  687.  425.  608.  399.  359.  429.  205.  776.  694.  923.  296.
  756.  348.  467.  391.  966.  413.  627.  294.  595.  433.  390.  287.
  685.  263.  255.  650.  298.  710.  428.  639.  280.  629.  262.  473.
  586.  572.  376.  211.  517.  844.  636.  394.  607.  282.  706.  480.
  735.  658.  741.  697.  418.  301.  383.  663.  709.  891.  204.  757.
  573.  667.  372.  678.  395.  729.  278.  700.  339.  471.  275.  291.
  701.  381.  526.  630.  460.  904.  637.  719.  601.  857.  375.  378.
  277.  208.  295.  458.  613.  855.  698.  713.  416.  688.  648.  369.
  677.  600.  409.  232.  622.  200.  250.  481.  534.  422.  462.  191.
  393.  193.  219.  486.  664.  306.  644.  616.  297.  487.  562.  543.
  484.  318.  440.  289.  561.  513.  659.  455.  596.  353.  752.  579.
  414.  987.  220.  545.  489.  417.  693.  438.  657.  222.  474.  766.
  587.  459.  408.  557.  256.  233.  633.  730.  816.  968.  621.  584.
  506.  401.  559.  272.  660.  574.  536.  450.  552.  838.  535.  285.
  631.  563.  476.  508.  597.  717.  478.  726.  254.  382. 1018.  283.
  368.  721.  270.]
scaled_radius_of_gyration : [184. 158. 220. 127. 188. 264. 172. 164. 112. 152. 118. 192. 161. 206.
 246. 125. 151. 223. 133. 177. 141. 224. 174. 139. 216. 163. 120. 204.
 130. 200. 218. 186. 202. 232. 189. 156. 179. 146. 245. 165. 230. 119.
 176. 212. 142. 185. 159. 183. 171. 209. 221. 147. 210. 173. 214. 116.
 144. 148. 257. 162. 129. 132. 229. 155. 234. 170. 201. 150. 137. 167.
 143. 187. 136. 138. 194. 153. 157. 198. 131. 168. 134. 145. 115. 178.
 199. 195. 154. 149. 239. 140. 180. 217. 124. 190. 242. 169. 135. 166.
 238. 121. 128. 191. 123. 219. 181.  nan 213. 175. 197. 211. 126. 249.
 203. 205. 160. 222. 247. 226. 261. 231. 236. 253. 235. 182. 262. 193.
 117. 250. 196. 243. 113. 228. 240. 241. 244. 109. 207. 268. 260. 215.
 208. 237. 255. 114.]
scaled_radius_of_gyration.1 : [ 70.  72.  73.  63. 127.  85.  66.  67.  64.  65.  71.  74.  80.  75.
  82.  68.  69.  76.  83. 118.  86.  77.  88.  62.  79.  78.  nan  81.
  87. 119.  97.  60.  61.  89.  90.  84. 135.  91.  59.  99.]
skewness_about : [ 6.  9. 14.  5. 13.  3.  2.  4.  8.  0.  7.  1. 10. 17. 20. 18. nan 11.
 16. 21. 12. 22. 15. 19.]
skewness_about.1 : [16. 14.  9. 10. 11.  1.  3. 26. 13.  2.  5.  6.  4. 28.  7. 20. 38. 25.
 15.  0.  8. 24. 21. 18. 23. 30. 12. 29. 27. nan 33. 32. 41. 39. 17. 22.
 19. 35. 31. 36. 40. 34.]
skewness_about.2 : [187. 189. 188. 199. 180. 181. 200. 193. 195. 194. 196. 197. 186. 198.
 185. 179. 192. 191. 190. 183. 184. 202. 201. 182. 176. 178. 203. 177.
  nan 204. 206.]
hollows_ratio : [197 199 196 207 183 204 202 208 195 194 185 193 192 206 201 205 200 189
 182 209 184 187 188 198 191 190 203 186 210 211 181]
class : ['van' 'car' 'bus']
In [8]:
pd.value_counts(sh_df['class']).plot(kind = 'bar')
plt.title('Bar Plot for Class variable')
plt.xlabel('Class')
plt.ylabel('Value Counts')
plt.show()
In [86]:
sh_df['class'].value_counts()
Out[86]:
car    429
bus    218
van    199
Name: class, dtype: int64

1b - Find missing values

In [9]:
sh_df.isnull().sum()
Out[9]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Since above technique did not show missing values we will loop through individual column

In [11]:
sh_df.isna().sum()
Out[11]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Since above technique did not show missing values we will loop through individual column

1c - Treat missing values

In [13]:
# Both using isnull and isnan produced same missing values count.
# Fill the missing values with mean (as per obersations from the describe)

sh_df = sh_df.fillna(sh_df.mean())
In [14]:
sh_df.isnull().sum()
Out[14]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [15]:
## Verify after imputing

for col in sh_df.columns:
    print('# Missing values for col \'{}\': {}'.format(col, sh_df[col].isna().sum()))
# Missing values for col 'compactness': 0
# Missing values for col 'circularity': 0
# Missing values for col 'distance_circularity': 0
# Missing values for col 'radius_ratio': 0
# Missing values for col 'pr.axis_aspect_ratio': 0
# Missing values for col 'max.length_aspect_ratio': 0
# Missing values for col 'scatter_ratio': 0
# Missing values for col 'elongatedness': 0
# Missing values for col 'pr.axis_rectangularity': 0
# Missing values for col 'max.length_rectangularity': 0
# Missing values for col 'scaled_variance': 0
# Missing values for col 'scaled_variance.1': 0
# Missing values for col 'scaled_radius_of_gyration': 0
# Missing values for col 'scaled_radius_of_gyration.1': 0
# Missing values for col 'skewness_about': 0
# Missing values for col 'skewness_about.1': 0
# Missing values for col 'skewness_about.2': 0
# Missing values for col 'hollows_ratio': 0
# Missing values for col 'class': 0

Univarivate Analysis, BI Varaite Analysis

In [16]:
sns.distplot(sh_df['compactness'])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x227c8441400>
In [17]:
sns.distplot(sh_df['compactness'])
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x227c8b6af28>
In [18]:
sns.distplot(sh_df['compactness'])
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x227c8bec710>
In [81]:
for i in sh_df.columns:
    sh_df.groupby(i)['class'].value_counts().unstack().plot(kind = 'bar', stacked = True, figsize = (8,6))
In [72]:
sns.pairplot(sh_df)
Out[72]:
<seaborn.axisgrid.PairGrid at 0x1cf831959e8>
In [51]:
sns.pairplot(data=sh_df, diag_kind='kde', hue='class')
Out[51]:
<seaborn.axisgrid.PairGrid at 0x16d621b3898>
In [87]:
sns.set()
sns.pairplot(sh_df, size = 2.0)
plt.show()
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\seaborn\axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)
In [19]:
sh_df.corr()
Out[19]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
compactness 1.000000 0.685421 0.789909 0.689840 0.091704 0.148249 0.812235 -0.788643 0.813636 0.676143 0.762770 0.815901 0.585156 -0.250071 0.235687 0.157387 0.298526 0.365552
circularity 0.685421 1.000000 0.793016 0.620967 0.153362 0.251208 0.848207 -0.821901 0.844972 0.961943 0.796822 0.838525 0.926888 0.052642 0.144394 -0.011851 -0.105645 0.045318
distance_circularity 0.789909 0.793016 1.000000 0.767079 0.158397 0.264550 0.904400 -0.911435 0.893128 0.774669 0.861980 0.887328 0.705953 -0.225852 0.113813 0.265553 0.145563 0.332095
radius_ratio 0.689840 0.620967 0.767079 1.000000 0.663559 0.450036 0.734228 -0.789795 0.708285 0.569205 0.794041 0.720150 0.536536 -0.180819 0.048720 0.173832 0.382129 0.471262
pr.axis_aspect_ratio 0.091704 0.153362 0.158397 0.663559 1.000000 0.648704 0.103715 -0.183264 0.079395 0.127128 0.273414 0.089620 0.122111 0.152776 -0.058481 -0.032134 0.239849 0.267724
max.length_aspect_ratio 0.148249 0.251208 0.264550 0.450036 0.648704 1.000000 0.165967 -0.180041 0.161592 0.305943 0.318955 0.143713 0.189704 0.295574 0.015439 0.043489 -0.026180 0.143919
scatter_ratio 0.812235 0.848207 0.904400 0.734228 0.103715 0.165967 1.000000 -0.970723 0.989370 0.808356 0.948296 0.993784 0.799266 -0.027985 0.074308 0.213127 0.005167 0.118448
elongatedness -0.788643 -0.821901 -0.911435 -0.789795 -0.183264 -0.180041 -0.970723 1.000000 -0.949077 -0.775519 -0.936715 -0.955074 -0.766029 0.103481 -0.051997 -0.185691 -0.114727 -0.216719
pr.axis_rectangularity 0.813636 0.844972 0.893128 0.708285 0.079395 0.161592 0.989370 -0.949077 1.000000 0.811447 0.934568 0.989490 0.797068 -0.015676 0.082974 0.214734 -0.018990 0.099191
max.length_rectangularity 0.676143 0.961943 0.774669 0.569205 0.127128 0.305943 0.808356 -0.775519 0.811447 1.000000 0.745209 0.796018 0.866425 0.041220 0.135745 0.001658 -0.104254 0.076770
scaled_variance 0.762770 0.796822 0.861980 0.794041 0.273414 0.318955 0.948296 -0.936715 0.934568 0.745209 1.000000 0.947021 0.778975 0.112299 0.036005 0.195260 0.014418 0.086594
scaled_variance.1 0.815901 0.838525 0.887328 0.720150 0.089620 0.143713 0.993784 -0.955074 0.989490 0.796018 0.947021 1.000000 0.796070 -0.016608 0.076974 0.201573 0.006637 0.103762
scaled_radius_of_gyration 0.585156 0.926888 0.705953 0.536536 0.122111 0.189704 0.799266 -0.766029 0.797068 0.866425 0.778975 0.796070 1.000000 0.191440 0.166371 -0.055973 -0.224866 -0.118157
scaled_radius_of_gyration.1 -0.250071 0.052642 -0.225852 -0.180819 0.152776 0.295574 -0.027985 0.103481 -0.015676 0.041220 0.112299 -0.016608 0.191440 1.000000 -0.088304 -0.126417 -0.749509 -0.802608
skewness_about 0.235687 0.144394 0.113813 0.048720 -0.058481 0.015439 0.074308 -0.051997 0.082974 0.135745 0.036005 0.076974 0.166371 -0.088304 1.000000 -0.035023 0.115145 0.096870
skewness_about.1 0.157387 -0.011851 0.265553 0.173832 -0.032134 0.043489 0.213127 -0.185691 0.214734 0.001658 0.195260 0.201573 -0.055973 -0.126417 -0.035023 1.000000 0.077428 0.205090
skewness_about.2 0.298526 -0.105645 0.145563 0.382129 0.239849 -0.026180 0.005167 -0.114727 -0.018990 -0.104254 0.014418 0.006637 -0.224866 -0.749509 0.115145 0.077428 1.000000 0.892840
hollows_ratio 0.365552 0.045318 0.332095 0.471262 0.267724 0.143919 0.118448 -0.216719 0.099191 0.076770 0.086594 0.103762 -0.118157 -0.802608 0.096870 0.205090 0.892840 1.000000
In [20]:
plt.subplots(figsize=(10,8))
sns.heatmap(sh_df.corr())
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x227c8c67d30>

Following variables are highly correlated (positively) : scaled_variance, scaled_variance.1, scatter_ratio, pr.axis_rectangulaity Following variable are highly negatively correlated: elongatedness, compactness, circularity, distance_circularity, radius_ratio, scatter_ratio, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration The variables, skewness_about.2, hollows_ratioo and scaled_radius_of_gyration.1 are also highly negatively correlated

From above it is clear the following features have outlier - radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, scaled_radius_of_gyration.1, skewness_about, skewness_about.1

1d - Find outliers

In [21]:
from itertools import chain

numeric_cols = sh_df.select_dtypes(include=['float64', 'int64']).columns

outlier_rec = []
# For each predictors find outliers using mathematical function.
def find_outlier(df_in, col_name, verbose=False):
    q25 = df_in[col_name].quantile(0.25)
    q75 = df_in[col_name].quantile(0.75)
    iqr = q75-q25 #Interquartile range
    lower, upper  = q25-1.5*iqr, q75+1.5*iqr
    outliers_df = df_in[(df_in[col_name] < lower) | (df_in[col_name] > upper)]
    outliers_removed_df = df_in[(df_in[col_name] >= lower) & (df_in[col_name] <= upper)]
    
    if(verbose):
        print('# Number of outliers / non-outliers for column \'{}\': {} /{}'.format(
            col_name, outliers_df.shape[0], outliers_removed_df.shape[0]))
    return outliers_df.index.tolist();

for feature in numeric_cols:
    outlier_rec.append(find_outlier(sh_df, feature, True))
    
outlier_rec = list(chain.from_iterable(outlier_rec))
outlier_rec = list(set(outlier_rec)) 
outlier_rec.sort()
print('# Total outliers in the dataset: {}'.format(len(outlier_rec)))
print(outlier_rec)
# Number of outliers / non-outliers for column 'compactness': 0 /846
# Number of outliers / non-outliers for column 'circularity': 0 /846
# Number of outliers / non-outliers for column 'distance_circularity': 0 /846
# Number of outliers / non-outliers for column 'radius_ratio': 3 /843
# Number of outliers / non-outliers for column 'pr.axis_aspect_ratio': 8 /838
# Number of outliers / non-outliers for column 'max.length_aspect_ratio': 13 /833
# Number of outliers / non-outliers for column 'scatter_ratio': 0 /846
# Number of outliers / non-outliers for column 'elongatedness': 0 /846
# Number of outliers / non-outliers for column 'pr.axis_rectangularity': 0 /846
# Number of outliers / non-outliers for column 'max.length_rectangularity': 0 /846
# Number of outliers / non-outliers for column 'scaled_variance': 1 /845
# Number of outliers / non-outliers for column 'scaled_variance.1': 2 /844
# Number of outliers / non-outliers for column 'scaled_radius_of_gyration': 0 /846
# Number of outliers / non-outliers for column 'scaled_radius_of_gyration.1': 15 /831
# Number of outliers / non-outliers for column 'skewness_about': 12 /834
# Number of outliers / non-outliers for column 'skewness_about.1': 1 /845
# Number of outliers / non-outliers for column 'skewness_about.2': 0 /846
# Number of outliers / non-outliers for column 'hollows_ratio': 0 /846
# Total outliers in the dataset: 33
[4, 37, 44, 47, 79, 85, 100, 113, 123, 127, 132, 135, 190, 230, 291, 346, 381, 388, 391, 400, 498, 505, 516, 523, 544, 623, 655, 706, 761, 796, 797, 815, 835]
In [22]:
plt.figure(figsize=(15,10))
labels = sh_df.columns
sns.set(style='whitegrid')
sbplot = sns.boxplot(data=sh_df)
sbplot.set_xticklabels(labels=labels, rotation=45)
Out[22]:
[Text(0, 0, 'compactness'),
 Text(0, 0, 'circularity'),
 Text(0, 0, 'distance_circularity'),
 Text(0, 0, 'radius_ratio'),
 Text(0, 0, 'pr.axis_aspect_ratio'),
 Text(0, 0, 'max.length_aspect_ratio'),
 Text(0, 0, 'scatter_ratio'),
 Text(0, 0, 'elongatedness'),
 Text(0, 0, 'pr.axis_rectangularity'),
 Text(0, 0, 'max.length_rectangularity'),
 Text(0, 0, 'scaled_variance'),
 Text(0, 0, 'scaled_variance.1'),
 Text(0, 0, 'scaled_radius_of_gyration'),
 Text(0, 0, 'scaled_radius_of_gyration.1'),
 Text(0, 0, 'skewness_about'),
 Text(0, 0, 'skewness_about.1'),
 Text(0, 0, 'skewness_about.2'),
 Text(0, 0, 'hollows_ratio')]
In [23]:
fig, ax = plt.subplots(4,5, figsize=(22,20))

sns.boxplot(sh_df['compactness'], ax=ax[0,0])
sns.boxplot(sh_df['circularity'], ax=ax[0,1])
sns.boxplot(sh_df['distance_circularity'], ax=ax[0,2])
sns.boxplot(sh_df['radius_ratio'], ax=ax[0,3])
sns.boxplot(sh_df['pr.axis_aspect_ratio'], ax=ax[0,4])
sns.boxplot(sh_df['max.length_aspect_ratio'], ax=ax[1,0])
sns.boxplot(sh_df['scatter_ratio'], ax=ax[1,1])
sns.boxplot(sh_df['elongatedness'], ax=ax[1,2])
sns.boxplot(sh_df['pr.axis_rectangularity'], ax=ax[1,3])
sns.boxplot(sh_df['max.length_rectangularity'], ax=ax[1,4])
sns.boxplot(sh_df['scaled_variance'], ax=ax[2,0])
sns.boxplot(sh_df['scaled_radius_of_gyration'], ax=ax[2,1])
sns.boxplot(sh_df['scaled_radius_of_gyration.1'], ax=ax[2,2])
sns.boxplot(sh_df['skewness_about'], ax=ax[2,3])
sns.boxplot(sh_df['skewness_about.1'], ax=ax[2,4])
sns.boxplot(sh_df['skewness_about.2'], ax=ax[3,0])
sns.boxplot(sh_df['hollows_ratio'], ax=ax[3,1])
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x227c83d95f8>

From above it is clear the following features have outlier - radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, scaled_radius_of_gyration.1, skewness_about, skewness_about.

1e - Treat outliers

In [95]:
pd.crosstab(sh_df['radius_ratio'], sh_df['class'])
Out[95]:
class bus car van
radius_ratio
104.0 0 1 0
105.0 0 0 1
109.0 0 0 1
110.0 0 1 2
111.0 0 1 3
112.0 0 0 1
113.0 2 0 2
114.0 0 1 3
115.0 0 2 2
116.0 2 3 2
117.0 0 1 3
118.0 1 1 0
119.0 1 2 2
120.0 7 1 1
121.0 2 3 3
122.0 3 2 0
123.0 6 2 2
124.0 1 0 2
125.0 5 2 6
126.0 3 1 4
127.0 1 5 1
128.0 4 2 0
129.0 1 2 2
130.0 5 5 2
131.0 0 2 4
132.0 1 3 1
133.0 1 3 7
134.0 0 1 2
135.0 1 2 2
136.0 1 5 5
... ... ... ...
208.0 0 4 0
209.0 2 9 0
210.0 0 3 0
211.0 0 8 0
212.0 0 5 0
213.0 1 7 0
214.0 0 2 0
215.0 0 5 0
216.0 1 2 0
217.0 0 1 0
218.0 0 2 0
219.0 1 5 0
220.0 0 4 0
221.0 0 4 0
222.0 0 5 0
223.0 1 2 0
224.0 0 2 0
225.0 0 4 0
226.0 1 0 0
227.0 1 1 0
228.0 0 5 0
230.0 0 4 0
231.0 0 5 1
232.0 0 1 0
234.0 0 2 0
235.0 1 0 0
238.0 1 0 0
246.0 1 0 1
250.0 0 0 1
252.0 1 0 3

132 rows × 3 columns

In [96]:
sh_df['radius_ratio'] = np.where(sh_df['radius_ratio']>=300, 252, sh_df['radius_ratio'])
In [97]:
pd.crosstab(sh_df['radius_ratio'], sh_df['class'])
Out[97]:
class bus car van
radius_ratio
104.0 0 1 0
105.0 0 0 1
109.0 0 0 1
110.0 0 1 2
111.0 0 1 3
112.0 0 0 1
113.0 2 0 2
114.0 0 1 3
115.0 0 2 2
116.0 2 3 2
117.0 0 1 3
118.0 1 1 0
119.0 1 2 2
120.0 7 1 1
121.0 2 3 3
122.0 3 2 0
123.0 6 2 2
124.0 1 0 2
125.0 5 2 6
126.0 3 1 4
127.0 1 5 1
128.0 4 2 0
129.0 1 2 2
130.0 5 5 2
131.0 0 2 4
132.0 1 3 1
133.0 1 3 7
134.0 0 1 2
135.0 1 2 2
136.0 1 5 5
... ... ... ...
208.0 0 4 0
209.0 2 9 0
210.0 0 3 0
211.0 0 8 0
212.0 0 5 0
213.0 1 7 0
214.0 0 2 0
215.0 0 5 0
216.0 1 2 0
217.0 0 1 0
218.0 0 2 0
219.0 1 5 0
220.0 0 4 0
221.0 0 4 0
222.0 0 5 0
223.0 1 2 0
224.0 0 2 0
225.0 0 4 0
226.0 1 0 0
227.0 1 1 0
228.0 0 5 0
230.0 0 4 0
231.0 0 5 1
232.0 0 1 0
234.0 0 2 0
235.0 1 0 0
238.0 1 0 0
246.0 1 0 1
250.0 0 0 1
252.0 1 0 3

132 rows × 3 columns

In [98]:
pd.crosstab(sh_df['pr.axis_aspect_ratio'], sh_df['class'])
Out[98]:
class bus car van
pr.axis_aspect_ratio
47.00000 2 0 0
48.00000 1 0 3
49.00000 1 0 2
50.00000 3 1 1
51.00000 7 1 3
52.00000 3 2 9
53.00000 8 13 6
54.00000 12 16 10
55.00000 7 19 11
56.00000 13 28 16
57.00000 7 29 8
58.00000 4 22 17
59.00000 6 43 15
60.00000 5 33 8
61.00000 8 26 8
61.67891 1 1 0
62.00000 6 42 10
63.00000 4 28 13
64.00000 17 30 22
65.00000 13 20 5
66.00000 10 17 10
67.00000 7 16 5
68.00000 12 15 7
69.00000 13 10 2
70.00000 10 7 1
71.00000 10 5 0
72.00000 7 2 1
73.00000 6 1 0
74.00000 7 2 0
75.00000 8 0 6
In [99]:
sh_df['pr.axis_aspect_ratio'] = np.where(sh_df['pr.axis_aspect_ratio']>=76, 75, sh_df['pr.axis_aspect_ratio'])
In [100]:
pd.crosstab(sh_df['pr.axis_aspect_ratio'], sh_df['class'])
Out[100]:
class bus car van
pr.axis_aspect_ratio
47.00000 2 0 0
48.00000 1 0 3
49.00000 1 0 2
50.00000 3 1 1
51.00000 7 1 3
52.00000 3 2 9
53.00000 8 13 6
54.00000 12 16 10
55.00000 7 19 11
56.00000 13 28 16
57.00000 7 29 8
58.00000 4 22 17
59.00000 6 43 15
60.00000 5 33 8
61.00000 8 26 8
61.67891 1 1 0
62.00000 6 42 10
63.00000 4 28 13
64.00000 17 30 22
65.00000 13 20 5
66.00000 10 17 10
67.00000 7 16 5
68.00000 12 15 7
69.00000 13 10 2
70.00000 10 7 1
71.00000 10 5 0
72.00000 7 2 1
73.00000 6 1 0
74.00000 7 2 0
75.00000 8 0 6
In [101]:
pd.crosstab(sh_df['max.length_aspect_ratio'], sh_df['class'])
Out[101]:
class bus car van
max.length_aspect_ratio
2 0 0 1
3 0 2 2
4 7 9 2
5 26 15 10
6 78 35 19
7 80 62 26
8 22 58 33
9 0 60 34
10 0 80 32
11 0 78 30
12 5 30 10
In [102]:
sh_df['max.length_aspect_ratio'] = np.where(sh_df['max.length_aspect_ratio']>=13, 12, sh_df['max.length_aspect_ratio'])
In [103]:
pd.crosstab(sh_df['max.length_aspect_ratio'], sh_df['class'])
Out[103]:
class bus car van
max.length_aspect_ratio
2 0 0 1
3 0 2 2
4 7 9 2
5 26 15 10
6 78 35 19
7 80 62 26
8 22 58 33
9 0 60 34
10 0 80 32
11 0 78 30
12 5 30 10
In [104]:
pd.crosstab(sh_df['scaled_variance'], sh_df['class'])
Out[104]:
class bus car van
scaled_variance
130.0 0 1 0
131.0 0 1 0
132.0 0 1 0
134.0 0 0 1
135.0 0 3 3
136.0 0 1 1
137.0 0 3 3
138.0 0 0 3
139.0 0 1 3
140.0 0 0 4
141.0 0 2 3
142.0 0 1 3
143.0 0 1 2
144.0 0 2 1
145.0 0 2 2
146.0 0 2 1
147.0 0 3 3
148.0 0 4 4
149.0 0 1 2
150.0 0 1 2
151.0 0 5 0
152.0 0 5 0
153.0 0 1 2
154.0 0 3 4
155.0 0 4 2
156.0 0 0 5
157.0 0 0 5
158.0 0 4 4
159.0 0 2 8
160.0 0 1 6
... ... ... ...
230.0 0 1 0
231.0 1 8 1
232.0 2 9 0
234.0 1 4 0
235.0 0 2 0
236.0 0 1 0
237.0 1 1 0
238.0 1 2 0
240.0 1 0 0
241.0 1 1 0
243.0 1 0 0
246.0 1 0 0
247.0 1 0 0
254.0 2 0 0
256.0 1 0 0
258.0 1 0 0
262.0 1 0 0
263.0 1 0 0
264.0 1 0 0
265.0 1 0 0
266.0 1 0 0
267.0 2 0 0
269.0 2 0 0
272.0 1 0 1
275.0 2 0 0
278.0 1 0 0
280.0 3 0 0
285.0 3 0 0
287.0 1 0 0
288.0 1 0 1

128 rows × 3 columns

In [105]:
sh_df['scaled_variance'] = np.where(sh_df['scaled_variance']>=300, 288, sh_df['scaled_variance'])
In [106]:
pd.crosstab(sh_df['scaled_variance'], sh_df['class'])
Out[106]:
class bus car van
scaled_variance
130.0 0 1 0
131.0 0 1 0
132.0 0 1 0
134.0 0 0 1
135.0 0 3 3
136.0 0 1 1
137.0 0 3 3
138.0 0 0 3
139.0 0 1 3
140.0 0 0 4
141.0 0 2 3
142.0 0 1 3
143.0 0 1 2
144.0 0 2 1
145.0 0 2 2
146.0 0 2 1
147.0 0 3 3
148.0 0 4 4
149.0 0 1 2
150.0 0 1 2
151.0 0 5 0
152.0 0 5 0
153.0 0 1 2
154.0 0 3 4
155.0 0 4 2
156.0 0 0 5
157.0 0 0 5
158.0 0 4 4
159.0 0 2 8
160.0 0 1 6
... ... ... ...
230.0 0 1 0
231.0 1 8 1
232.0 2 9 0
234.0 1 4 0
235.0 0 2 0
236.0 0 1 0
237.0 1 1 0
238.0 1 2 0
240.0 1 0 0
241.0 1 1 0
243.0 1 0 0
246.0 1 0 0
247.0 1 0 0
254.0 2 0 0
256.0 1 0 0
258.0 1 0 0
262.0 1 0 0
263.0 1 0 0
264.0 1 0 0
265.0 1 0 0
266.0 1 0 0
267.0 2 0 0
269.0 2 0 0
272.0 1 0 1
275.0 2 0 0
278.0 1 0 0
280.0 3 0 0
285.0 3 0 0
287.0 1 0 0
288.0 1 0 1

128 rows × 3 columns

In [107]:
pd.crosstab(sh_df['skewness_about'], sh_df['class'])
Out[107]:
class bus car van
skewness_about
0.000000 19 44 14
1.000000 22 38 21
2.000000 19 29 15
3.000000 17 26 13
4.000000 28 25 17
5.000000 22 35 14
6.000000 25 27 13
6.364286 4 2 0
7.000000 18 22 20
8.000000 15 19 13
9.000000 14 22 10
10.000000 6 21 9
11.000000 4 20 7
12.000000 2 18 10
13.000000 2 18 6
14.000000 0 14 4
15.000000 0 15 4
16.000000 0 10 1
17.000000 1 5 5
18.000000 0 4 2
19.000000 0 15 1
In [108]:
sh_df['skewness_about'] = np.where(sh_df['skewness_about']>=20, 19, sh_df['skewness_about'])
In [109]:
pd.crosstab(sh_df['skewness_about'], sh_df['class'])
Out[109]:
class bus car van
skewness_about
0.000000 19 44 14
1.000000 22 38 21
2.000000 19 29 15
3.000000 17 26 13
4.000000 28 25 17
5.000000 22 35 14
6.000000 25 27 13
6.364286 4 2 0
7.000000 18 22 20
8.000000 15 19 13
9.000000 14 22 10
10.000000 6 21 9
11.000000 4 20 7
12.000000 2 18 10
13.000000 2 18 6
14.000000 0 14 4
15.000000 0 15 4
16.000000 0 10 1
17.000000 1 5 5
18.000000 0 4 2
19.000000 0 15 1
In [110]:
pd.crosstab(sh_df['scaled_radius_of_gyration'], sh_df['class'])
Out[110]:
class bus car van
scaled_radius_of_gyration
109.0 0 0 1
112.0 0 0 3
113.0 0 0 1
114.0 0 1 0
115.0 0 1 1
116.0 0 0 2
117.0 1 2 0
118.0 0 1 1
119.0 1 0 3
120.0 0 2 0
121.0 0 4 3
123.0 0 5 3
124.0 1 1 4
125.0 1 3 2
126.0 0 2 1
127.0 3 4 5
128.0 0 4 2
129.0 1 2 1
130.0 0 4 1
131.0 0 2 1
132.0 1 4 1
133.0 1 3 1
134.0 0 2 3
135.0 2 2 1
136.0 1 4 1
137.0 2 4 4
138.0 1 5 2
139.0 2 8 3
140.0 0 3 4
141.0 2 3 1
... ... ... ...
199.0 1 4 1
200.0 1 6 2
201.0 2 7 0
202.0 0 6 0
203.0 0 5 1
204.0 0 7 0
205.0 1 3 0
206.0 2 2 0
207.0 0 2 0
208.0 1 0 0
209.0 2 2 0
210.0 1 5 0
211.0 0 1 0
212.0 0 7 0
213.0 1 6 0
214.0 2 13 0
215.0 0 3 0
216.0 0 10 0
217.0 1 6 0
218.0 1 12 0
219.0 1 5 0
220.0 0 5 0
221.0 1 5 0
222.0 1 5 0
223.0 0 7 0
224.0 0 4 0
226.0 0 2 0
228.0 1 1 0
229.0 1 1 0
240.0 22 19 0

118 rows × 3 columns

In [115]:
sh_df['scaled_radius_of_gyration'] = np.where(sh_df['scaled_radius_of_gyration']>=200, 210, sh_df['scaled_radius_of_gyration'])
In [117]:
pd.crosstab(sh_df['scaled_radius_of_gyration'], sh_df['class'])
Out[117]:
class bus car van
scaled_radius_of_gyration
109.000000 0 0 1
112.000000 0 0 3
113.000000 0 0 1
114.000000 0 1 0
115.000000 0 1 1
116.000000 0 0 2
117.000000 1 2 0
118.000000 0 1 1
119.000000 1 0 3
120.000000 0 2 0
121.000000 0 4 3
123.000000 0 5 3
124.000000 1 1 4
125.000000 1 3 2
126.000000 0 2 1
127.000000 3 4 5
128.000000 0 4 2
129.000000 1 2 1
130.000000 0 4 1
131.000000 0 2 1
132.000000 1 4 1
133.000000 1 3 1
134.000000 0 2 3
135.000000 2 2 1
136.000000 1 4 1
137.000000 2 4 4
138.000000 1 5 2
139.000000 2 8 3
140.000000 0 3 4
141.000000 2 3 1
... ... ... ...
172.000000 9 2 3
173.000000 10 3 2
174.000000 4 7 5
174.709716 2 0 0
175.000000 1 2 2
176.000000 8 2 9
177.000000 7 3 3
178.000000 4 6 2
179.000000 3 1 4
180.000000 3 3 0
181.000000 0 1 1
182.000000 2 1 3
183.000000 5 2 0
184.000000 3 4 3
185.000000 6 2 7
186.000000 9 5 10
187.000000 1 4 4
188.000000 3 3 3
189.000000 3 3 1
190.000000 3 3 0
191.000000 3 2 0
192.000000 2 1 1
193.000000 0 4 0
194.000000 1 3 0
195.000000 1 5 0
196.000000 0 3 0
197.000000 0 5 0
198.000000 5 4 0
199.000000 1 4 1
210.000000 42 157 3

90 rows × 3 columns

In [113]:
# After Treating the outliers
plt.figure(figsize=(15,10))
labels = sh_df.columns
sns.set(style='whitegrid')
sbplot = sns.boxplot(data=sh_df)
sbplot.set_xticklabels(labels=labels, rotation=45)
Out[113]:
[Text(0, 0, 'compactness'),
 Text(0, 0, 'circularity'),
 Text(0, 0, 'distance_circularity'),
 Text(0, 0, 'radius_ratio'),
 Text(0, 0, 'pr.axis_aspect_ratio'),
 Text(0, 0, 'max.length_aspect_ratio'),
 Text(0, 0, 'scatter_ratio'),
 Text(0, 0, 'elongatedness'),
 Text(0, 0, 'pr.axis_rectangularity'),
 Text(0, 0, 'max.length_rectangularity'),
 Text(0, 0, 'scaled_variance'),
 Text(0, 0, 'scaled_variance.1'),
 Text(0, 0, 'scaled_radius_of_gyration'),
 Text(0, 0, 'scaled_radius_of_gyration.1'),
 Text(0, 0, 'skewness_about'),
 Text(0, 0, 'skewness_about.1'),
 Text(0, 0, 'skewness_about.2'),
 Text(0, 0, 'hollows_ratio')]
In [114]:
print('# Shape of dataset before removing outliers:{}'.format(sh_df.shape[0]))

## Treating ouliers. Action to take - drop the rows

# sh_df.drop(sh_df.index[outlier_rec], inplace=True)

print('# Shape of dataset before after outliers:{}'.format(sh_df.shape[0]))
# Shape of dataset before removing outliers:846
# Shape of dataset before after outliers:846

2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (2.5 points)

Two appraoch to find relationship between different attributes (Independent variable)¶

1) Using correlation matrix and plot in heatmap to see visually

2) Using variance inflation factor (VIF)

In [42]:
# Create correlation matrix
corr_matrix = sh_df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
to_drop
Out[42]:
['elongatedness',
 'pr.axis_rectangularity',
 'max.length_rectangularity',
 'scaled_variance',
 'scaled_variance.1']
In [43]:
plt.figure(figsize=(20,12))
sns.heatmap(corr_matrix, annot=True)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x227ca1bc6d8>
In [44]:
# Standardize the dataset
from sklearn.preprocessing import StandardScaler

# Standardize the feature matrix
feature_space = pd.DataFrame(sh_df, columns=numeric_cols)
std_sh = StandardScaler().fit_transform(feature_space)

std_sh_df = pd.DataFrame(std_sh, columns=numeric_cols)
In [45]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

predictor_variables = numeric_cols

threshold = 10
for i in np.arange(0,len(predictor_variables)):
    vif = [variance_inflation_factor(std_sh_df[predictor_variables].values, j) for j in range(std_sh_df[predictor_variables].shape[1])]
    maxindex = vif.index(max(vif))
    if max(vif) > threshold:
        #print ("VIF :", vif)
        print('Eliminating \'' + std_sh_df[predictor_variables].columns[maxindex] + '\' at index: ' + str(maxindex))
        #del predictor_variables[maxindex]
    else:
        break
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6
Eliminating 'scatter_ratio' at index: 6

3. Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance) - 10 points

In [46]:
C_df2 = sh_df.drop(columns=['class'])
In [47]:
from sklearn.decomposition import PCA

# Create a PCA that will retain 95% of the variance
pca_model = PCA()

X_test_reduced = pca_model.fit_transform(std_sh_df)
X_test_reduced.shape
Out[47]:
(846, 18)
In [48]:
pca_model.components_
Out[48]:
array([[ 2.72309648e-01,  2.88505662e-01,  3.02346567e-01,
         2.67766395e-01,  9.51653369e-02,  1.89858448e-01,
         3.10902027e-01, -3.09089186e-01,  3.07935629e-01,
         2.79231130e-01,  3.00591300e-01,  3.07653564e-01,
         2.64897583e-01, -3.42140296e-02,  4.11122110e-02,
         5.85970714e-02,  3.51986558e-02,  8.18318796e-02],
       [-1.00200443e-01,  1.31829895e-01, -5.49412498e-02,
        -1.91532122e-01, -2.39736179e-01, -1.01388477e-01,
         6.74283225e-02, -4.36581824e-03,  7.98800073e-02,
         1.20932751e-01,  8.10074663e-02,  7.40971656e-02,
         2.13766469e-01,  4.74200760e-01, -4.02670879e-02,
        -1.05917349e-01, -5.19178420e-01, -5.23977284e-01],
       [-1.73258843e-01, -6.82895527e-02, -5.26126409e-02,
         2.89902042e-01,  5.62581670e-01, -1.53720250e-01,
         1.53299645e-02, -7.53480500e-02, -8.97628273e-03,
        -1.20943347e-01,  1.27294401e-01,  1.21938927e-02,
        -4.55379105e-02,  2.22558876e-01, -6.67123300e-01,
         7.44022560e-03, -1.92913114e-02, -6.57164562e-02],
       [ 3.99941118e-02, -1.90540791e-01,  1.05522995e-01,
        -9.80487317e-02, -3.80557127e-01, -8.00197451e-02,
         1.17680050e-01, -6.69273892e-02,  1.26860930e-01,
        -1.75051235e-01,  9.38119876e-02,  1.19830184e-01,
        -2.04891001e-01, -5.10311932e-02, -2.75314825e-01,
         7.59698692e-01, -6.47927389e-02, -1.31039418e-02],
       [ 1.64652476e-01, -1.36001045e-01, -1.01717057e-01,
         1.22420446e-01,  8.48258659e-02, -7.30299195e-01,
         1.02722554e-01, -9.51967899e-02,  9.66030687e-02,
        -2.66896638e-01,  1.68980324e-01,  1.45457012e-01,
         6.30500843e-03,  1.02787416e-01,  4.07225124e-01,
        -2.07917255e-02,  2.13769624e-01, -1.26760966e-01],
       [-1.26916099e-01, -2.74015681e-02,  6.25023465e-03,
         1.79840238e-01,  4.27093751e-01,  2.70478864e-01,
        -8.90517717e-02,  8.86111028e-02, -9.22648277e-02,
        -1.63797533e-02, -1.75834215e-02, -1.15146596e-01,
        -8.34551832e-03,  3.13732846e-01,  5.17272023e-01,
         5.08858095e-01, -1.75849642e-01, -3.36583971e-02],
       [ 3.51622132e-01, -3.58664762e-01,  7.13536086e-02,
         1.43047218e-01, -1.35480411e-01,  3.92959311e-01,
         4.36883227e-02, -3.53841468e-02,  4.80014092e-02,
        -2.71277119e-01,  1.95522144e-01,  4.32861158e-02,
        -4.43073739e-01,  3.58729276e-01,  3.81240071e-02,
        -3.20794900e-01, -1.13358686e-02, -2.72874033e-02],
       [ 3.66345568e-01,  1.87802088e-01, -2.98295562e-01,
        -5.68043190e-02, -1.34932190e-01, -6.17201560e-02,
        -1.39820902e-01,  2.04824819e-01, -1.17281840e-01,
         3.11325217e-01,  6.02799818e-02, -1.00109086e-01,
         9.84139827e-02,  5.14294038e-01, -1.53879539e-01,
         1.69579385e-01,  3.87745500e-01,  2.31289676e-01],
       [-6.90576868e-01, -4.45586241e-02,  1.44923260e-01,
         5.16484679e-02, -2.96452640e-01,  5.69447708e-02,
         1.05264630e-02, -1.78500988e-01, -5.46677111e-02,
        -9.13142537e-02,  3.04128398e-01, -1.34970951e-02,
         1.39635360e-01,  2.85348684e-01,  7.81761129e-02,
        -8.00715630e-02,  3.03168953e-01,  2.54927705e-01],
       [ 2.71763214e-01, -7.64070642e-02,  3.54208055e-01,
         1.66171305e-01, -1.10299404e-01,  1.20597680e-02,
        -2.17755712e-01,  1.78452108e-01, -2.53419676e-01,
        -4.13588849e-01,  1.21375016e-01, -2.23499888e-01,
         5.97877740e-01, -4.69864084e-02, -9.47273047e-02,
         1.81657522e-02, -8.11320855e-02, -1.19923419e-02],
       [-2.29389931e-03, -1.55498578e-01, -7.17457254e-01,
        -3.63161654e-02,  5.84352282e-05,  2.94179021e-01,
         1.53529567e-01, -1.04116707e-01,  1.78413751e-01,
        -3.13059406e-01,  1.24903118e-03,  1.61454292e-01,
         4.06661612e-01, -9.41206581e-02, -4.37209588e-03,
         2.81781863e-02, -1.97076338e-02,  8.63576862e-02],
       [-1.20162557e-01,  2.27904903e-01, -2.57117356e-01,
         7.18757281e-01, -3.16372794e-01, -2.88664456e-02,
        -5.16361790e-02,  3.68038016e-01,  1.14028113e-02,
         5.67189902e-02,  1.45938417e-01,  8.55608973e-03,
        -1.63688331e-01, -1.89802153e-01,  1.56952916e-02,
        -7.83707908e-04, -1.24823241e-01, -9.21457421e-02],
       [-5.15751356e-02, -9.95782218e-02,  1.45425756e-01,
        -4.01905419e-02,  4.15457583e-02, -1.71667394e-01,
         7.06523272e-02,  4.52725886e-01,  3.30437502e-01,
        -4.52121844e-02, -1.60510051e-01,  2.47961170e-01,
         5.20077926e-02,  1.70762505e-01,  9.48238654e-04,
        -9.48513275e-02, -3.53655990e-01,  5.99246068e-01],
       [-1.15453388e-01, -2.56471431e-01,  1.19718169e-01,
        -1.62441077e-01,  1.30096197e-01,  1.57786381e-01,
        -1.36086712e-02,  5.73762853e-01,  2.40840804e-01,
         1.25481315e-01,  1.84464569e-01,  2.35995831e-01,
         1.13891244e-01, -9.81005673e-02, -1.41293644e-02,
         1.33707136e-02,  3.99102976e-01, -4.04023770e-01],
       [ 3.31498599e-02, -6.30440995e-01, -1.08894167e-01,
         5.79838750e-02, -3.27319810e-03, -1.14295279e-01,
        -7.32184864e-02, -9.99223104e-02, -1.42571409e-01,
         5.39349174e-01,  3.36825749e-01, -1.34197314e-01,
         1.12984933e-01, -1.52023868e-01,  2.46097540e-02,
        -1.04543801e-02, -2.47959760e-01,  1.30557112e-01],
       [-2.77870864e-02, -3.33700397e-01,  1.08451026e-01,
         3.77272254e-01, -1.62137615e-01, -6.56619101e-03,
         3.65675785e-02, -1.59711558e-01,  2.29853933e-01,
         1.69146598e-01, -6.86113480e-01, -7.62671140e-02,
         1.87555633e-01,  1.66771240e-01, -3.34615910e-02,
         1.03714298e-02,  1.89859788e-01, -1.49411264e-01],
       [-6.79536745e-03, -8.25094029e-02,  2.70392515e-02,
         6.12766330e-02, -2.27758209e-02,  1.99678934e-02,
         3.81802344e-01,  9.65880773e-02, -7.13576830e-01,
         4.56630616e-02, -1.81909282e-01,  5.34233525e-01,
         3.47353288e-02,  3.77175924e-02, -4.43087857e-03,
         3.08164983e-03,  1.32534324e-02, -5.73907002e-03],
       [-4.60320084e-03, -1.39829575e-03,  7.46282899e-03,
        -3.57252049e-02,  1.92644909e-02, -1.68745847e-02,
         7.83169817e-01,  2.12561839e-01, -5.47927981e-04,
        -1.08652997e-02,  4.17313376e-02, -5.79960379e-01,
         4.63161524e-03,  1.60550622e-03, -2.42572019e-03,
        -1.33065447e-02,  3.19953674e-02,  2.63045667e-03]])
In [49]:
e_variance = pca_model.explained_variance_
e_variance
Out[49]:
array([9.72092060e+00, 3.16106957e+00, 1.19653230e+00, 1.19311939e+00,
       8.66641182e-01, 7.64188080e-01, 3.54941462e-01, 2.50809515e-01,
       1.87485195e-01, 9.22605349e-02, 6.39918602e-02, 4.60864294e-02,
       4.37757512e-02, 2.96788700e-02, 1.96990431e-02, 1.78170651e-02,
       9.23611228e-03, 3.04880698e-03])
In [50]:
e_variance_ratio = pca_model.explained_variance_ratio_
e_variance_ratio
Out[50]:
array([5.39412786e-01, 1.75407394e-01, 6.63954420e-02, 6.62060604e-02,
       4.80898213e-02, 4.24047103e-02, 1.96956616e-02, 1.39173917e-02,
       1.04035323e-02, 5.11952666e-03, 3.55090110e-03, 2.55733076e-03,
       2.42911149e-03, 1.64687715e-03, 1.09309768e-03, 9.88666928e-04,
       5.12510827e-04, 1.69177955e-04])
In [51]:
np.cumsum(e_variance_ratio)
Out[51]:
array([0.53941279, 0.71482018, 0.78121562, 0.84742168, 0.8955115 ,
       0.93791621, 0.95761188, 0.97152927, 0.9819328 , 0.98705233,
       0.99060323, 0.99316056, 0.99558967, 0.99723655, 0.99832964,
       0.99931831, 0.99983082, 1.        ])
In [52]:
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca_model.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Vehicle Dataset Explained Variance')
plt.show()

Based on the above cummulative sum, the first 3 columns covers 95% variance

Correlation between componenets and features

In [53]:
pca_df = pd.DataFrame(data=pca_model.components_, columns=C_df2.columns)
In [54]:
pca_df.head()
Out[54]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.272310 0.288506 0.302347 0.267766 0.095165 0.189858 0.310902 -0.309089 0.307936 0.279231 0.300591 0.307654 0.264898 -0.034214 0.041112 0.058597 0.035199 0.081832
1 -0.100200 0.131830 -0.054941 -0.191532 -0.239736 -0.101388 0.067428 -0.004366 0.079880 0.120933 0.081007 0.074097 0.213766 0.474201 -0.040267 -0.105917 -0.519178 -0.523977
2 -0.173259 -0.068290 -0.052613 0.289902 0.562582 -0.153720 0.015330 -0.075348 -0.008976 -0.120943 0.127294 0.012194 -0.045538 0.222559 -0.667123 0.007440 -0.019291 -0.065716
3 0.039994 -0.190541 0.105523 -0.098049 -0.380557 -0.080020 0.117680 -0.066927 0.126861 -0.175051 0.093812 0.119830 -0.204891 -0.051031 -0.275315 0.759699 -0.064793 -0.013104
4 0.164652 -0.136001 -0.101717 0.122420 0.084826 -0.730299 0.102723 -0.095197 0.096603 -0.266897 0.168980 0.145457 0.006305 0.102787 0.407225 -0.020792 0.213770 -0.126761
In [55]:
pca_df.corr()
Out[55]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
compactness 1.000000 0.007843 0.000885 -0.008950 0.001978 0.000317 -0.007196 -0.004816 -0.001192 0.000342 -0.005161 -0.002923 -0.006440 -0.009993 0.000745 -0.004250 0.000194 0.000299
circularity 0.007843 1.000000 -0.019410 0.196308 -0.043385 -0.006960 0.157830 0.105622 0.026146 -0.007503 0.113201 0.064105 0.141262 0.219176 -0.016336 0.093222 -0.004266 -0.006548
distance_circularity 0.000885 -0.019410 1.000000 0.022150 -0.004895 -0.000785 0.017808 0.011918 0.002950 -0.000847 0.012773 0.007233 0.015939 0.024730 -0.001843 0.010519 -0.000481 -0.000739
radius_ratio -0.008950 0.196308 0.022150 1.000000 0.049509 0.007942 -0.180108 -0.120530 -0.029837 0.008562 -0.129179 -0.073153 -0.161201 -0.250113 0.018641 -0.106380 0.004868 0.007473
pr.axis_aspect_ratio 0.001978 -0.043385 -0.004895 0.049509 1.000000 -0.001755 0.039805 0.026638 0.006594 -0.001892 0.028549 0.016167 0.035626 0.055276 -0.004120 0.023511 -0.001076 -0.001652
max.length_aspect_ratio 0.000317 -0.006960 -0.000785 0.007942 -0.001755 1.000000 0.006386 0.004273 0.001058 -0.000304 0.004580 0.002594 0.005715 0.008867 -0.000661 0.003772 -0.000173 -0.000265
scatter_ratio -0.007196 0.157830 0.017808 -0.180108 0.039805 0.006386 1.000000 -0.096905 -0.023989 0.006884 -0.103859 -0.058815 -0.129604 -0.201089 0.014987 -0.085529 0.003914 0.006008
elongatedness -0.004816 0.105622 0.011918 -0.120530 0.026638 0.004273 -0.096905 1.000000 -0.016053 0.004607 -0.069503 -0.039359 -0.086733 -0.134571 0.010030 -0.057237 0.002619 0.004021
pr.axis_rectangularity -0.001192 0.026146 0.002950 -0.029837 0.006594 0.001058 -0.023989 -0.016053 1.000000 0.001140 -0.017205 -0.009743 -0.021470 -0.033312 0.002483 -0.014169 0.000648 0.000995
max.length_rectangularity 0.000342 -0.007503 -0.000847 0.008562 -0.001892 -0.000304 0.006884 0.004607 0.001140 1.000000 0.004937 0.002796 0.006161 0.009560 -0.000713 0.004066 -0.000186 -0.000286
scaled_variance -0.005161 0.113201 0.012773 -0.129179 0.028549 0.004580 -0.103859 -0.069503 -0.017205 0.004937 1.000000 -0.042184 -0.092956 -0.144227 0.010749 -0.061344 0.002807 0.004309
scaled_variance.1 -0.002923 0.064105 0.007233 -0.073153 0.016167 0.002594 -0.058815 -0.039359 -0.009743 0.002796 -0.042184 1.000000 -0.052640 -0.081675 0.006087 -0.034739 0.001590 0.002440
scaled_radius_of_gyration -0.006440 0.141262 0.015939 -0.161201 0.035626 0.005715 -0.129604 -0.086733 -0.021470 0.006161 -0.092956 -0.052640 1.000000 -0.179979 0.013414 -0.076550 0.003503 0.005377
scaled_radius_of_gyration.1 -0.009993 0.219176 0.024730 -0.250113 0.055276 0.008867 -0.201089 -0.134571 -0.033312 0.009560 -0.144227 -0.081675 -0.179979 1.000000 0.020813 -0.118773 0.005435 0.008343
skewness_about 0.000745 -0.016336 -0.001843 0.018641 -0.004120 -0.000661 0.014987 0.010030 0.002483 -0.000713 0.010749 0.006087 0.013414 0.020813 1.000000 0.008852 -0.000405 -0.000622
skewness_about.1 -0.004250 0.093222 0.010519 -0.106380 0.023511 0.003772 -0.085529 -0.057237 -0.014169 0.004066 -0.061344 -0.034739 -0.076550 -0.118773 0.008852 1.000000 0.002312 0.003549
skewness_about.2 0.000194 -0.004266 -0.000481 0.004868 -0.001076 -0.000173 0.003914 0.002619 0.000648 -0.000186 0.002807 0.001590 0.003503 0.005435 -0.000405 0.002312 1.000000 -0.000162
hollows_ratio 0.000299 -0.006548 -0.000739 0.007473 -0.001652 -0.000265 0.006008 0.004021 0.000995 -0.000286 0.004309 0.002440 0.005377 0.008343 -0.000622 0.003549 -0.000162 1.000000
In [56]:
sns.heatmap(pca_df)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x227cc2d2a90>
In [57]:
range_ = list(range(1,X_test_reduced.shape[1]+1))

plt.plot(range_, e_variance)
Out[57]:
[<matplotlib.lines.Line2D at 0x227cc3c1668>]
In [58]:
cum_var_exp = np.cumsum(e_variance_ratio)

e_variance_ratio, cum_var_exp
Out[58]:
(array([5.39412786e-01, 1.75407394e-01, 6.63954420e-02, 6.62060604e-02,
        4.80898213e-02, 4.24047103e-02, 1.96956616e-02, 1.39173917e-02,
        1.04035323e-02, 5.11952666e-03, 3.55090110e-03, 2.55733076e-03,
        2.42911149e-03, 1.64687715e-03, 1.09309768e-03, 9.88666928e-04,
        5.12510827e-04, 1.69177955e-04]),
 array([0.53941279, 0.71482018, 0.78121562, 0.84742168, 0.8955115 ,
        0.93791621, 0.95761188, 0.97152927, 0.9819328 , 0.98705233,
        0.99060323, 0.99316056, 0.99558967, 0.99723655, 0.99832964,
        0.99931831, 0.99983082, 1.        ]))
In [59]:
with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(8, 6))

    plt.bar(range(len(e_variance_ratio)), e_variance_ratio, alpha=0.5, align='center',
            label='individual explained variance')
    plt.step(range(len(e_variance_ratio)), cum_var_exp, where='mid',
             label='cumulative explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()
In [60]:
# Repeating the above steps by removing the collinear columns
std_sh_drop_df = std_sh_df.drop(to_drop, axis=1)

print(std_sh_drop_df.shape)
std_sh_drop_df.head()
(846, 13)
Out[60]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.160580 0.517302 0.056545 0.287723 1.858171 0.867276 -0.208038 0.285618 -0.327938 -0.069503 0.380665 -0.312193 0.183957
1 -0.325470 -0.624564 0.120112 -0.850348 -0.743740 0.394826 -0.599893 -0.513719 -0.059987 0.553405 0.156589 0.013088 0.452977
2 1.254193 0.843549 1.518571 1.241242 0.817407 0.867276 1.148382 1.392391 0.073989 1.591584 -0.403603 -0.149552 0.049447
3 -0.082445 -0.624564 -0.007021 -0.296691 0.297024 0.394826 -0.750606 -1.466773 -1.265769 -0.069503 -0.291565 1.639494 1.529056
4 -1.054545 -0.135193 -0.769817 1.118208 2.378554 1.812177 -0.599893 0.408593 7.308682 0.553405 -0.179527 -1.450677 -1.699181
In [68]:
pca_model = PCA(n_components=7)

X_test_reduced = pca_model.fit_transform(std_sh_drop_df)
In [69]:
pca_model.components_
Out[69]:
array([[-0.36670069, -0.35601942, -0.3988937 , -0.37919939, -0.1756613 ,
        -0.2822505 , -0.37914701, -0.31135534,  0.12445068, -0.0749812 ,
        -0.08621168, -0.12925528, -0.19850955],
       [ 0.0089776 ,  0.25702313,  0.06979747, -0.07938911, -0.18880243,
        -0.00755828,  0.1886819 ,  0.33174512,  0.46906324, -0.00932412,
        -0.08971561, -0.51703103, -0.4958304 ],
       [-0.17665163,  0.02617315, -0.10632542,  0.29248202,  0.6750045 ,
        -0.10187854, -0.05120462,  0.04936921,  0.20475261, -0.46709319,
        -0.37094997,  0.0207073 , -0.04534806],
       [-0.06661136, -0.11019921,  0.1217731 ,  0.0544463 , -0.04543913,
         0.07819306,  0.10811583, -0.15180351,  0.05787619, -0.62893087,
         0.71032007, -0.13824315, -0.01201274],
       [ 0.00664759, -0.12530335, -0.07555822,  0.22140082,  0.38827546,
        -0.38250739,  0.01776722, -0.02069846,  0.2955731 ,  0.53721443,
         0.4925126 ,  0.02993799, -0.11084193],
       [-0.29478944, -0.01468434, -0.01557207, -0.02822778,  0.24700665,
         0.77976477, -0.25732478, -0.10044065,  0.13050103,  0.26235141,
         0.09629069, -0.25362792,  0.08388509],
       [ 0.56945945, -0.32089855, -0.01191282,  0.16279286, -0.12242081,
         0.20463971,  0.06552533, -0.45218926,  0.48184669, -0.03306343,
        -0.19447819,  0.09378035, -0.02403586],
       [-0.05313991,  0.28713718, -0.29559965, -0.11745816, -0.19289474,
         0.05041209, -0.18373639,  0.32567103,  0.52807345, -0.10000492,
         0.13941522,  0.44166539,  0.36140684],
       [-0.60587159, -0.22118309,  0.47026477,  0.27787284, -0.28308109,
        -0.05721864,  0.2368474 , -0.07627368,  0.24482732,  0.08684879,
        -0.16223423,  0.18588526,  0.10934233],
       [ 0.21179121, -0.43719284,  0.52257928, -0.23767832,  0.11645934,
        -0.07883476, -0.4505751 ,  0.43190873,  0.08911163, -0.05791398,
        -0.00926022, -0.05225958,  0.10475837],
       [-0.00537511, -0.50143723, -0.44624179,  0.45385663, -0.21058113,
         0.16634978,  0.06820882,  0.48334972, -0.15986944, -0.01409448,
         0.01326646, -0.05518489, -0.06434218],
       [ 0.05524205,  0.12381945, -0.04800361,  0.28534965, -0.14154604,
        -0.26013343, -0.16101089, -0.09961165,  0.04815939,  0.01737458,
        -0.06241862, -0.6068964 ,  0.63205703],
       [-0.0271613 , -0.29358151, -0.14398474, -0.50016741,  0.23185821,
         0.01377701,  0.64191887,  0.10274066,  0.08959784,  0.02533259,
        -0.05275859, -0.13672065,  0.36805447]])
In [70]:
pca_model.explained_variance_
Out[70]:
array([5.51728612, 2.92357046, 1.18074492, 1.11301483, 0.76837701,
       0.67710708, 0.31577685, 0.20308193, 0.14402132, 0.0535837 ,
       0.04802619, 0.03565988, 0.03513434])
In [71]:
pca_model.explained_variance_ratio_
Out[71]:
array([0.42390496, 0.22462421, 0.09071917, 0.08551532, 0.05903606,
       0.05202359, 0.02426181, 0.01560322, 0.01106547, 0.00411695,
       0.00368996, 0.00273983, 0.00269945])
In [72]:
e_variance = pca_model.explained_variance_
range_ = list(range(1,std_sh_drop_df.shape[1]+1))

plt.plot(range_, e_variance)
Out[72]:
[<matplotlib.lines.Line2D at 0x227cc74b518>]
In [73]:
e_variance_ratio = pca_model.explained_variance_ratio_
cum_var_exp = np.cumsum(e_variance_ratio)

e_variance_ratio, cum_var_exp
Out[73]:
(array([0.42390496, 0.22462421, 0.09071917, 0.08551532, 0.05903606,
        0.05202359, 0.02426181, 0.01560322, 0.01106547, 0.00411695,
        0.00368996, 0.00273983, 0.00269945]),
 array([0.42390496, 0.64852917, 0.73924834, 0.82476367, 0.88379972,
        0.93582332, 0.96008513, 0.97568835, 0.98675382, 0.99087077,
        0.99456073, 0.99730055, 1.        ]))
In [74]:
with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(8, 6))

    plt.bar(range(len(e_variance_ratio)), e_variance_ratio, alpha=0.5, align='center',
            label='individual explained variance')
    plt.step(range(len(e_variance_ratio)), cum_var_exp, where='mid',
             label='cumulative explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()
In [75]:
corr_df = pd.DataFrame(pca_model.components_)

corr_df
Out[75]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 -0.366701 -0.356019 -0.398894 -0.379199 -0.175661 -0.282251 -0.379147 -0.311355 0.124451 -0.074981 -0.086212 -0.129255 -0.198510
1 0.008978 0.257023 0.069797 -0.079389 -0.188802 -0.007558 0.188682 0.331745 0.469063 -0.009324 -0.089716 -0.517031 -0.495830
2 -0.176652 0.026173 -0.106325 0.292482 0.675005 -0.101879 -0.051205 0.049369 0.204753 -0.467093 -0.370950 0.020707 -0.045348
3 -0.066611 -0.110199 0.121773 0.054446 -0.045439 0.078193 0.108116 -0.151804 0.057876 -0.628931 0.710320 -0.138243 -0.012013
4 0.006648 -0.125303 -0.075558 0.221401 0.388275 -0.382507 0.017767 -0.020698 0.295573 0.537214 0.492513 0.029938 -0.110842
5 -0.294789 -0.014684 -0.015572 -0.028228 0.247007 0.779765 -0.257325 -0.100441 0.130501 0.262351 0.096291 -0.253628 0.083885
6 0.569459 -0.320899 -0.011913 0.162793 -0.122421 0.204640 0.065525 -0.452189 0.481847 -0.033063 -0.194478 0.093780 -0.024036
7 -0.053140 0.287137 -0.295600 -0.117458 -0.192895 0.050412 -0.183736 0.325671 0.528073 -0.100005 0.139415 0.441665 0.361407
8 -0.605872 -0.221183 0.470265 0.277873 -0.283081 -0.057219 0.236847 -0.076274 0.244827 0.086849 -0.162234 0.185885 0.109342
9 0.211791 -0.437193 0.522579 -0.237678 0.116459 -0.078835 -0.450575 0.431909 0.089112 -0.057914 -0.009260 -0.052260 0.104758
10 -0.005375 -0.501437 -0.446242 0.453857 -0.210581 0.166350 0.068209 0.483350 -0.159869 -0.014094 0.013266 -0.055185 -0.064342
11 0.055242 0.123819 -0.048004 0.285350 -0.141546 -0.260133 -0.161011 -0.099612 0.048159 0.017375 -0.062419 -0.606896 0.632057
12 -0.027161 -0.293582 -0.143985 -0.500167 0.231858 0.013777 0.641919 0.102741 0.089598 0.025333 -0.052759 -0.136721 0.368054

4. Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy. (7.5 points)

SVM Approach - 1

In [76]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
In [77]:
parameter_candidates = [
  {'C':[0.01, 0.05, 0.5, 1], 'kernel': ['linear']},
  {'C':[0.01, 0.05, 0.5, 1], 'kernel': ['rbf']}
]
In [78]:
X = std_sh_df
y = sh_df['class']
# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=SVC(), param_grid=parameter_candidates, cv = 5)

# Train the classifier on data1's feature and target data
clf.fit(X, y)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[78]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [0.01, 0.05, 0.5, 1], 'kernel': ['linear']},
                         {'C': [0.01, 0.05, 0.5, 1], 'kernel': ['rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [79]:
# Let’s look at the accuracy score when we apply the model to the data1’s test data.

# View the accuracy score
print('Best score for data1:', clf.best_score_)
Best score for data1: 0.966903073286052
In [80]:
# Which parameters are the best? We can tell scikit-learn to display them:
# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
Best C: 1
Best Kernel: rbf
In [81]:
# Repeating the above steps dropping columns with coorelation > .95
X = std_sh_drop_df
y = sh_df['class']
# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=SVC(), param_grid=parameter_candidates, cv = 5)

# Train the classifier on data1's feature and target data
clf.fit(X, y)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[81]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [0.01, 0.05, 0.5, 1], 'kernel': ['linear']},
                         {'C': [0.01, 0.05, 0.5, 1], 'kernel': ['rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [82]:
# Let’s look at the accuracy score when we apply the model to the data1’s test data.

# View the accuracy score
print('Best score for data1:', clf.best_score_)
Best score for data1: 0.9491725768321513
In [83]:
# Which parameters are the best? We can tell scikit-learn to display them:
# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
Best C: 1
Best Kernel: rbf
In [84]:
# Use SVC with redcued feature set fro PCA
X = X_test_reduced
y = sh_df['class']
# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=SVC(), param_grid=parameter_candidates, cv = 5)

# Train the classifier on data1's feature and target data
clf.fit(X, y) 

# View the accuracy score
print('Best score for data1:', clf.best_score_)

# Which parameters are the best? We can tell scikit-learn to display them:
# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Best score for data1: 0.9491725768321513
Best C: 1
Best Kernel: rbf
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

SVM Approach - 2

In [85]:
from sklearn.model_selection import train_test_split

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score, confusion_matrix

target = sh_df["class"]
features = sh_df.drop(["class"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(features, target, stratify=target, test_size = 0.2, random_state = 10)
In [118]:
print(X_train.shape)
print(X_test.shape)
(676, 18)
(170, 18)
In [119]:
print(y_train.shape)
print(y_test.shape)
(676,)
(170,)
In [120]:
# use from sklearn.svm import SVC
from sklearn.svm import SVC

# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)
In [121]:
# check the accuracy on the training set
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))
0.9763313609467456
0.9411764705882353
In [122]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[38  3  0]
 [ 3 82  0]
 [ 3  1 40]]
In [123]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit(features)
In [124]:
data_scaled= scaler.transform(features)
In [125]:
data_scaled
Out[125]:
array([[0.47826087, 0.57692308, 0.59722222, ..., 0.3902439 , 0.36666667,
        0.53333333],
       [0.39130435, 0.30769231, 0.61111111, ..., 0.34146341, 0.43333333,
        0.6       ],
       [0.67391304, 0.65384615, 0.91666667, ..., 0.2195122 , 0.4       ,
        0.5       ],
       ...,
       [0.7173913 , 0.80769231, 0.84722222, ..., 0.09756098, 0.36666667,
        0.66666667],
       [0.2826087 , 0.11538462, 0.52777778, ..., 0.6097561 , 0.46666667,
        0.46666667],
       [0.26086957, 0.11538462, 0.36111111, ..., 0.43902439, 0.33333333,
        0.3       ]])
In [126]:
X_train, X_test, y_train, y_test = train_test_split(data_scaled, target, stratify=target, test_size = 0.2, random_state = 10)
In [127]:
# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)
In [128]:
# check the accuracy on the training set
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))
0.8505917159763313
0.8
In [129]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[28  3  5]
 [12 73  0]
 [ 4 10 35]]
In [130]:
# Building a Support Vector Machine on train data
svc_model = SVC(C= 1000, kernel='linear', gamma= 1)
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)
In [131]:
# check the accuracy on the training set
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))
0.9792899408284024
0.9411764705882353
In [132]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[39  3  0]
 [ 3 82  1]
 [ 2  1 39]]

Inreasing C allowed us to improve the model

In [133]:
import multiprocessing 
from sklearn.model_selection import GridSearchCV
In [134]:
param_grid = [    {        
     'kernel': ['linear', 'rbf'],        
     'C': [ 0.01, 0.05, 0.5, 1 ]    } ]
In [135]:
gs = GridSearchCV(estimator=SVC(), param_grid=param_grid,scoring='accuracy', cv=10, n_jobs=multiprocessing.cpu_count())
In [136]:
gs.fit(X_train, y_train) # (This X_train is X_train_scaled;  y_train 
C:\Users\ABHIJEET\Anaconda4\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[136]:
GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=4,
             param_grid=[{'C': [0.01, 0.05, 0.5, 1],
                          'kernel': ['linear', 'rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
In [137]:
# hyper parameters
gs.best_estimator_
Out[137]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [138]:
gs.best_score_ 
Out[138]:
0.908284023668639

Cross Validation

In [139]:
# Building a Support Vector Machine with best hyper parameters
svc_model = SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
    
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)
In [140]:
# check the accuracy on the training set
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))
0.9215976331360947
0.8764705882352941
In [141]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[36  6  3]
 [ 5 76  0]
 [ 3  4 37]]

The SVM Model was applied using two approaches. It has been determined by Approach-1, programmatically, that out of two kernels (Linear / RBF), the best one would be RBF and the best score was close to 96%.

With the Approach-2, both the Linear and RBF were applied separately by splitting the data in 80/20 ration. In this approach, even the cross validation was implemented.